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ABSTRACT 

C2H2 zinc fingers (C2H2-ZFs) are the most prevalent 
type of vertebrate DNA-binding domain, and typic- 
ally appear in tandem arrays (ZFAs), with sequential 
C2H2-ZFs each contacting three (or more) sequen- 
tial bases. C2H2-ZFs can be assembled in a modular 
fashion, providing one explanation for their remark- 
able evolutionary success. Given a set of modules 
with defined three-base specificities, modular 
assembly also presents a way to construct artificial 
proteins with specific DNA-binding preferences. 
However, a recent survey of a large number of 
three-finger ZFAs engineered by modular assembly 
reported high failure rates (- 70%), casting doubt on 
the generality of modular assembly. Here, we used 
protein-binding microarrays to analyze 28 ZFAs that 
failed in the aforementioned study. Most (17) 
preferred specific sequences, which in all but one 
case resembled the intended target sequence. Like 
natural ZFAs, the engineered ZFAs typically yielded 
degenerate motifs, binding dozens to hundreds of 
related individual sequences. Thus, the failure of 
these proteins in previous assays is not due to 
lack of sequence-specific DNA-binding activity. 
Our findings underscore the relevance of individual 
C2H2-ZF sequence specificities within tandem 
arrays, and support the general ability of modular 
assembly to produce ZFAs with sequence-specific 
DNA-binding activity. 

INTRODUCTION 

The C2H2 zinc finger (C2H2-ZF) is among the most 
prevalent DNA-binding domains in eukaryotes, and 
genes that encode this domain constitute nearly one-half 
of all known and predicted transcription factors in human 
and mouse (1-5). C2H2-ZF proteins typically have 



multiple C2H2-ZFs arranged in tandem, with each 
C2H2-ZF binding 3 (or more) bases, and with the 
fingers offset by three bases, so that a multi-fingered 
protein recognizes a longer DNA sequence that is 
thought to be largely a concatenation of each finger's spe- 
cificity (6). The dramatic expansion of the number of 
C2H2-ZFs in mammals appears to be a recent evolution- 
ary event, with their loci residing in clusters, indicating 
that the C2H2-ZF family evolved through tandem dupli- 
cations (2,3,7). The C2H2-ZF family is known to have 
remarkably diverse sequence specificity (6), and sequence 
analyses have suggested that the diversification of 
C2H2-ZF paralogs may be driven by positive selection 
on DNA-contacting residues (2,8). 

The evolutionary success of C2H2-ZFs may also be ex- 
plained in part by their capacity for modular assembly: 
individual C2H2-ZFs ('modules') can be recombined to 
produce proteins (Zinc Finger Arrays, or ZFAs) with 
new binding specificities, and both natural and artificial 
C2H2-ZFs have been used successfully in modular 
assembly of ZFAs with new sequence specificities (9,10) 
[reviewed in (6,11,12)]. Modular assembly of ZFAs has 
received much attention because of its utility in engineer- 
ing artificial transcription factors or zinc-finger nucleases 
(ZFNs) with desired sequence specificity: for example, 
ZFNs constructed by modular assembly have been used 
to successfully make targeted genome modifications in 
both plants and animals (13). It is also reasonable to 
posit that modular assembly serves as a mechanism for 
natural evolutionary diversification of C2H2-ZF proteins 
(14). In addition, modularity is an assumption that under- 
lies efforts to identify the sequence specificity of the thou- 
sands of natural ZFAs — most of which have not been 
experimentally characterized — by concatenating the 
known or predicted sequence specificities of their individ- 
ual C2H2-ZF components (15-17). 

Given the conceptual and practical importance of the 
modularity of C2H2-ZFs, it is important to know the 
limits and constraints of modular assembly, and in this 
regard the evidence is mixed. While there are many 
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examples supporting the retention of sequence specificity 
of individual C2H2-ZFs within ZFAs constructed by 
modular assembly [e.g. (6,11,12,18)], it is also known 
that the sequences recognized by a given C2H2-ZF can 
be influenced by the neighboring C2H2-ZF (19,20). The 
most straightforward explanation for dependence among 
neighboring C2H2-ZFs has been referred to as the 'target 
site overlap problem' (21): C2H2-ZFs often contact 
four-base subsites, such that there is one base of overlap 
between adjacent C2H2-ZFs (22,23). Alternative docking 
modes and contacts of up to five bases have also been 
observed (6,24). Interactions between side-chains also 
occur between sequential C2H2-ZFs and may be import- 
ant for both stability of the DNA-protein complex and 
for sequence specificity (24). Moreover, the spacing 
between adjacent C2H2-ZFs is not precisely equivalent 
to three bases [discussed in (25)], raising the possibility 
that interactions between adjacent C2H2-ZFs may 
impact the alignment of individual C2H2-ZFs with their 
subsites. 

A recent large-scale examination of modular assembly, 
hereafter referred to as Ramirez et al. (26), concluded that 
the modular assembly method of engineering ZFAs has an 
unexpectedly high failure rate of roughly 70%, in contrast 
to previous reports claiming 60% or 100% success (9,18). 
Ramirez et al. constructed a total of 204 ZFAs using three 
different collections of C2H2-ZF modules (9,27-29). The 
study tested 27 ZFAs by electrophoretic mobility shift 
assay (EMSA), among which seven succeeded. A subset 
of these failed ZFAs was then tested by a plant 
single-stranded annealing assay; all of these also failed. 
The study then tested 168 additional ZFAs by a 
bacterial-2-hybrid (B2H) assay, which tests a ZFA's 
ability to activate a reporter gene containing the 
intended ZFA binding site in the promoter, and 
obtained only 53 successes. Twenty-two of these ZFAs 
were tested by an episomal recombination assay, which 
supported the results of the B2H assays. In total, 144 of 
204 ZFAs failed at the assay(s) used to test them. 

Ramirez et al. found that much of the discrepancy 
between their findings and previous reports (9,18) can be 
accounted for by the fact that the previous reports were 
biased toward GNN subsites (i.e. the C2H2-ZF modules 
bound to sequences in which the 5'-base is a guanine). 
There are at least two reasons to expect a higher success 
rate with GNN subsites. First, in GNN-binding 
C2H2-ZFs, the amino acid Arg is typically found at 
position +6 of the recognition helix (which directly 
contacts the bases in the major groove), and Arg can 
make two hydrogen bonds with the 5'-base guanine, 
creating a particularly strong DNA-protein interaction 
(22). Second, GNN subsites may be the most compatible 
with the scaffolds used in current artificial ZFAs because 
many of the individual C2H2-ZF modules are variants of 
finger 2 of Zif268 (30-32), which naturally prefers 
GGG-G or TGG-G (the fourth base is a contact to the 
next triplet, which would further bias the neighboring 
triplet toward GNN). Other modules are derived 
from fingers 1, 2 or 3 of Spl, which naturally prefer 
GG(G/T), G(C/A)G and (G/T)GG, respectively (33). 
Indeed, Ramirez et al. obtained 59% success for ZFAs 



with three GNN subsites, but only 29, 12 and 0% 
success for ZFAs with 2, 1 and 0 GNN subsites. 

The high failure rates observed by Ramirez et al. call 
into question the general modularity of the C2H2-ZF 
motif. However, Ramirez et al. were seeking ZFAs that 
would function in specific assays, and in most cases did 
not directly assay DNA-binding: only a minority (27, or 
13%) were tested by EMSA. Moreover, the assays tested 
only the single anticipated 9-mer target. High specificity 
and/or affinity may be a requirement for ZFNs (and for 
the B2H assay) (34,35), but is not necessarily a constraint 
for the evolution of natural transcription factors; most 
transcription factors display degeneracy at multiple 
bases of the binding site (36). In fact, if recombination 
among C2H2-ZFs is used as an evolutionary mechanism 
for the generation of novel TFs, as has been previously 
proposed (14), one can imagine that flexibility and degen- 
eracy in the binding preferences of modular C2H2-ZFs 
could be beneficial for creating new DNA-binding 
activities. Analysis of useful engineered ZFAs by 
SELEX has also suggested degeneracy at some base pos- 
itions (18,37-39). Given these considerations, the blanket 
declaration that modular assembly generally fails may 
require qualification, since success and failure are depend- 
ent on the assays used and the goals of individual re- 
searchers. For example, modular assembly of a new 
ZFA with sequence-specific DNA-binding activity might 
be considered a 'success' by evolutionary biologists, and 
indeed many molecular biologists, even if the sequence 
preference contains degeneracy, or is otherwise not 
exactly what would have been predicted from the constitu- 
ent modules. Moreover, to our knowledge, the general 
concept of modularity does not require invariant 
behavior of modules in different contexts. Rather, it 
simply requires that the individual modules can function 
in different contexts. 

Here, we have more closely examined the DNA-binding 
specificities of 28 of the 'failed' ZFAs from Ramirez et al., 
using protein-binding microarrays (PBMs). PBMs have 
emerged in the last decade as a rapid and powerful tool 
for the analysis of sequence specificity of diverse proteins, 
including C2H2-ZFs (40). The PBM technique can be 
summarized as follows: a tagged DNA-binding protein 
is 'hybridized' to a microarray that contains a diverse set 
of approximately 41 000 35-mer probes, and subsequent 
addition of a fluorescently tagged antibody reveals the 
DNA sequences that the protein has bound, and to what 
degree. The DNA probes are designed such that all 
possible 10-mers are present once and only once; thus, 
all non-palindromic 8-mers are present 32 times, 
allowing for a robust and unbiased assessment of 
sequence preference to all possible 8-mers, and inference 
of DNA-binding motifs up to 14 bases wide (36,41,42). 
We and others have used PBMs to determine the 
binding specificities of hundreds of different transcription 
factors, from a wide range of species, with very little dis- 
crepancy between motifs obtained by PBM and motifs 
previously defined by more traditional methods, when 
available (36,41,43^17). In fact, JASPAR (48)— an 
open-access database for high-quality transcription 
factor binding site information — currently has more data 
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derived from PBM experiments than it has for all other 
data in the literature. 

In summary, for the failed ZFAs of Ramirez et al., 
PBM analysis reveals that most have sequence preferences 
similar to those intended. In addition, most of the individ- 
ual modules within functional ZFAs bind sequences that 
are identical or related to their known targets. Our 
analysis does recapitulate the bias toward GNN subsites. 
However, we conclude that the high failure rates observed 
by Ramirez et al. do not reflect a general failure of 
modular assembly to produce ZFAs with 
sequence-specific DNA-binding activity. 



MATERIALS AND METHODS 

Protein-binding microarray experiments 

Sequences of the two PBM 'all-10-mer' designs are given 
at http://hugheslab.ccbr.utoronto.ca/supplementary-data/ 
C2H2_modularity/. Details of the design and use of PBMs 
has been described elsewhere (41,47,49,50). Plasmids are 
listed in Supplementary Table SI. ZFAs were cloned as 
SacI-BamHI fragments into pTH5325, a modified 
T7-driven GST expression vector (see Supplementary 
Document of the Supplementary Data). Briefly, we used 
150ng of plasmid DNA in a 25 ja.1 in vitro transcription/ 
translation reaction using a PURExpress In Vitro Protein 
Synthesis Kit (New England BioLabs) supplemented with 
RNase inhibitor and 50 uM zinc acetate. After a 2-h incu- 
bation at 37°C, 12.5 ul of the mix was added to 137.5 ul of 
protein-binding solution for a final mix of PBS/2% skim 
milk/0.2 mg per ml BSA/50uM zinc acetate/0.1% 
Tween-20. This mixture was added to an array previously 
blocked with PBS/2% skim milk and washed once with 
PBS/0.1% Tween-20 and once with PBS/0.01% Triton-X 
100. After a 1-h incubation at room temperature, the array 
was washed once with PBS/0.5% Tween-20/50 uM zinc 
acetate and once with PBS/0.01% Triton-X 100/50 uM 
zinc acetate. Cy5-labeled anti-GST antibody was added, 
diluted in PBS/2% skim milk/50 uM zinc acetate. After a 
1-h incubation at room temperature, the array was washed 
three times with PBS/0.05% Tween-20/50 uM zinc acetate 
and once with PBS/50 uM zinc acetate. The array was then 
imaged using an Agilent microarray scanner at 2 uM 
resolution. 

Analysis of microarray data 

Image spot intensities were quantified using ImaGene 
software (BioDiscovery). To estimate the relative prefer- 
ence for each 8-mer, two different scores were calculated: 
the Z-score was calculated from the average signal inten- 
sity across the 16 or 32 spots containing each 8-mer; the 
'i?-score' (for enrichment) is a variation on Area Under 
the ROC curve (41) and is used here as it is highly repro- 
ducible and facilitates comparison between separate ex- 
periments. Each ZFA was tested on two different 
universal microarrays (designated ME and HK). i?-score 
data are discussed in the text; however, both Z- and E- 
score data are provided in the supplementary data online 
at http://hugheslab.ccbr.utoronto.ca/supplementary-data/ 



C2H2_modularity/. Microarray data have been deposited 
to GEO (accession number GSE25723). 

RESULTS 

Analysis of the sequence specificity of ZFAs 

Using PBMs, we assayed a total of 31 ZFAs, 28 of which 
were designated as failures by Ramirez et al. and three 
that were deemed successes, which we used as positive 
controls (Supplementary Table SI contains information 
about the ZFAs we tested; the Supplementary 
Document gives the sequence and map of the plasmid 
we used; Supplementary Table SI and all of the data 
can be found online at http://hugheslab.ccbr. utoronto 
.ca/supplementary-data/C2H2_modularity/). We chose 
the 28 ZFAs such that (i) 20 modules (of a total of 61 in 
our study) were tested in more than one context; (ii) the 
DNA triplets that the encompassed modules specified 
formed a diverse set, including GNN, CNN, ANN and 
TNN modules; (hi) the modules included both human 
C2H2-ZFs [Toolgen modules (9)] and C2H2-ZFs 
obtained by selection methods [Barbas (28) and 
Sangamo (27,29) modules] and (iv) 10 ZFAs that failed 
by EMSA in Ramirez et al. were included. We cloned each 
of the inserts into a GST expression vector and analyzed 
each of the proteins on two different PBM arrays, i.e. 
different designs, such that the 10-mers, and hence 
8-mers, are in different contexts between the two arrays 
(the arrays are designated 'ME' and 'HK', which are 
the initials of the designers of the arrays). We obtained 
essentially identical results from the two array types. 

PBM data can be represented in several ways (41,47), 
including motifs and consensus sequences, as well as a 
table of relative preferences for individual sequences, 
most typically all 32 896 possible 8-mers (collapsing 
reverse complements). A previously established threshold 
for statistical significance was described by Berger et al. 
(47) that utilizes 8-mer 'is-scores' — in essence, a score that 
reflects the relative ranking of the intensities of the 32 
probes that contain each 8-mer, relative to the remaining 
approximately 41 000 probes, ii-scores are similar to the 
AUC (Area under the ROC curve) statistical metric and 
range from —0.5 to 0.5. Permutation tests in which the 
identity of the array probes is scrambled have shown 
that any score at or above 0.45 would not be observed 
by chance in a data set much larger than the one used 
here (47). Using a success criterion that at least one 
8-mer must have an Zs-score of 0.45 or greater, all three 
of the control proteins were successes, as were 17 of the 28 
proteins that failed in Ramirez et al. For the remaining 1 1 , 
it is possible that these proteins simply lack DNA-binding 
activity. However, it is also possible that the proteins are 
misfolded; in our hands, heterologous expression of 
natural transcription factor DNA-binding domains as 
GST fusions yields an overall success rate of ~50% for 
obtaining a soluble protein with sequence-specific DNA- 
binding activity (data not shown). Notably, using the 
is > 0.45 criterion, all six of the ZFAs we assayed that 
were constructed from natural human C2H2-ZF 
modules were successful (see below), consistent with a 
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previous claim that naturally occurring human C2H2-ZFs 
have a high propensity to form functional ZFAs (51), 
although in our analysis their sequence specificity 
appears no higher than that of other modules 
(see below). Figure 1 shows a clustering analysis of all of 
the 8-mers with ii>0.45 in at least one experiment, 
illustrating that each ZFA has a distinct and reproducible 
spectrum of preferences for individual 8-mers. 



ZFA sequence preferences typically resemble intended 
targets 

We next asked whether the sequence specificities we 
obtained corresponded to those intended. Since the 
ZFAs were designed to recognize 9-base sites, we first 
examined how the intended target ranked among all 131 
072 possible 9-mers, using the same ls-score statistic 
described above. The 9-mer scores are noisier than the 

8- mer scores because they are based on a smaller 
number of probes and the threshold for statistical signifi- 
cance has not been explored as it has been for 8-mers; 
nonetheless, we observed that the intended 9-mer ranked 
very highly (above the 99.9th percentile, or top 131, of all 

9- mers, on both arrays) in most cases (13/20, including 
positive controls). For example, for all three of the 
positive control proteins (ZFA15, ZFA45 and ZFA93), 
the intended target is within the top 12 most highly 
ranked 9-mers for both array types (Figure 2). Among 
the 17 ZFAs that failed for Ramirez et al. but succeeded 
in the PBM assays, six of them (ZFA1, 5, 8, 10, 24 and 
152) recognized the intended sequence with similar preci- 
sion (within the top 12) (Figure 2), while others appear 
to prefer many other sequences more highly than the 
intended 9-mer target. For five ZFAs (4, 7, 57, 75 and 



188), the intended 9-mer target did not appear among 
the top 100 9-mers on either array (Figure 2). 

We also created motifs by aligning the 10 8-mers with 
the highest .E-scores (or fewer than 10, since we only 
included 8-mers with U-scores at or above 0.45; we used 

8- mers in order to take advantage of the U-score cutoff) 
(Figure 2; the Document of the Supplementary Data gives 
the full alignments). Consistent with the results of the 

9- mer analysis above, this procedure produced motifs 
resembling the intended targets for all three of the 
positive control ZFAs, and also for most of the ZFAs 
that failed in Ramirez et al. Indeed, the motifs produced 
could be easily aligned to the intended 9-mer target in all 
but one case (ZFA188, which we re-sequenced and 
re-analyzed twice, and obtained essentially identical 
results). However, it is also evident that there are many 
cases in which individual C2H2-ZF modules do not 
behave precisely as intended, including examples of degen- 
eracy or even unanticipated specificity. This is true even 
for the positive controls, e.g. Fl of ZFA15, F2 and F3 of 
ZFA45 and Fl of ZFA93 all display nearly complete de- 
generacy for at least one base position. 

Most C2H2-ZF modules display degeneracy 

We next asked whether individual modules appeared to 
bind their intended 3-bp subsite. We manually surmised 
the apparent specificity of the module in each instance that 
it was present in a ZFA using the (up to) top 10 DNA 
8-mers and 9-mers that the ZFA preferred, aligned to 
the binding sequence in a way similar to that shown in 
Figure 2 (full tables of aligned 8-mers and 9-mers and 
derived motifs are given in Supplementary Document of 
the Supplementary Data). A summary of this analysis is 
shown in Figure 3. All 38 C2H2-ZF modules present in at 
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experiment are included. 
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Figure 2. Sequence specificities of ZFAs constructed by modular assembly, as determined by PBM. ID for ZFA and results of assay for activity 
follow Ramirez et al. Ft, F2 and F3 columns indicate the module numbers used for construction of the ZFA. The rank of the intended 9-mer target 
(out of all 131 072 possible 9-mers) is determined by £-score; ME and HK refer to the two array designs used. The last column shows the intended 
target (based on the modules used for assembly) compared to PBM results (the sequence motif shown is generated from the (up to) top 10 8-mers 
bound by the ZFA, as described in the main text). 



least one successful ZFA are listed, along with their 
intended target subsite in each of the 20 successful 
ZFAs. Their apparent specificities are colored according 
to how closely they resemble the intended target, with 
green indicating complete agreement, yellow indicating de- 
generacy (but encompassing the intended target), red 
indicating disagreement and gray indicating no apparent 
contribution to sequence specificity despite being present 
in a successful ZFA. 



This analysis indicates that the majority of the modules 
do recognize either the intended triplet or a degenerate 
version, when embedded in a successful ZFA (Figure 3). 
However, it also underscores the importance of context: of 
the 15 C2H2-ZF modules that are present in more than 
one successful ZFA, only four appear to have precisely the 
same sequence specificity in all contexts. An additional six 
display different levels of degeneracy in different contexts, 
while the remaining five appear to specify at least one base 
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differently in different contexts. Nonetheless, degeneracy 
is most frequently consistent with flexibility of the 
intended triplet: yellow (degeneracy; 20 instances) is 
more common than red (disagreement; nine instances) or 
gray (no contribution; 1 instance) in Figure 3. It is also 
possible that some of the modules simply have poor 
intrinsic specificity. 

Degeneracy in binding specificities of both artificial ZFAs 
constructed by modular assembly and natural ZFAs 

Degeneracy and context dependence do not seem to be 
incompatible with success of ZFAs in either our assay or 
others: as noted above, all three positive controls (i.e. 
those which Ramirez et al. also scored as successful) dis- 
played some level of degeneracy (Figure 3) (additional 
examples in the literature are noted in the 'Introduction' 
section). ZFA45 in particular, which is one of the positive 
controls, displayed degeneracy at all three positions and 
two of its three constituent modules displayed higher spe- 
cificity in other contexts (Figure 3). Human C2H2-ZF 
modules ('Toolgen' modules in Figure 3) appear to be 
particularly prone to degeneracy and context dependence, 
despite having the highest success rate at producing ZFAs 
with sequence specificity. These observations are of 
interest because it is believed that it is desirable that 
engineered ZFAs are as specific as possible (34). 

To ask whether degeneracy is a general feature of ZFAs, 
we again took advantage of the fact that the PBM assay 



yields the number of 8-mers that are significantly preferred 
by a given protein, because all 8-mers scoring with E > 0.45 
can be considered as significantly preferred (47). Using this 
criterion, we previously found that human transcription 
factor DNA-binding domains typically have dozens to 
hundreds of preferred 8-mers (36). This number is presum- 
ably a property of both the width of the binding site, and 
the tolerance for variation at individual bases. Atf4, for 
example, has a very specific 8-base binding site, and 
yields only a single 8-mer with £>0.45 (TGACGTCA) 
(I. Mann and T.R. Hughes, unpublished data). 

The goal of engineered ZFAs is typically to achieve 
preference to a single 9-base sequence, which we reason 
would correspond to two or fewer highly preferred 8-base 
sequences. However, the ZFAs we analyzed typically 
yielded dozens of 8-mers with 2s > 0.45 (Figure 4, top). 
This number is comparable to what we previously 
observed with natural human ZFAs (Figure 4, bottom). 
Thus, both natural ZFAs and artificial ZFAs created by 
modular assembly display a level of degenerate binding 
that is comparable to other types of eukaryotic transcrip- 
tion factors. 

GNN C2H2-ZF modules have the highest success rate 

Finally, we re-examined the conclusion of Ramirez et al. 
that GNN C2H2-ZF modules account for most of the 
success of engineered ZFAs. Indeed, consistent with the 
findings of Ramirez et al., we observed that the success of 
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ZFAs in PBMs is lowest for those that lack GNN modules 
(Figure 5A). Our success rates are notably higher than 
those of Ramirez et al., particularly for those with two 
GNN subsites, where we obtained 100% success. The spe- 
cificity of individual modules within the 20 successful 
ZFAs is also highest for GNN subsites (Figure 5B), 
which specified an exact match to the intended triplet 
(i.e. no degeneracy) in 27 of 50 instances. Most of the 
eight ANN modules present in successful ZFAs also 
specified either an exact (three cases) or degenerate (four 
cases) match to the intended triplet. In contrast, the one 
CNN module present in a successful ZFA made no 
apparent contribution to sequence specificity. The one 
TNN module present in a successful ZFA did contribute 
to sequence specificity, but specified NGG instead of 
TGG. 



DISCUSSION 

Our analysis shows that modular assembly of C2H2-ZFs 
into ZFAs does not result in overwhelming failure with 
respect to obtaining proteins that bind DNA in a 
sequence-specific manner. The poor behavior of 
non-GNN modules (especially CNN and TNN 
modules), which may be explained by reasons outlined 
in the Introduction, does appear to account for many if 
not most of the failures in the PBM assay. Since most of 
the currently available CNN and TNN modules are 
derived from C2H2-ZFs that prefer GNN (or GNN-G), 
it is possible that the low success rates obtained with them 
is a property of the modules, rather than a property of the 
modular assembly procedure. 

We propose several possible explanations for the 
apparent discrepancy between our conclusions and those 
of Ramirez et al. The most obvious is that the PBM assay 
can detect binding to sequences that are different from the 
intended targets, whereas all of the assays in Ramirez et al. 
tested only a single intended target sequence. However, 
when we specifically asked whether the intended target 



9-mer is highly preferred in the PBM assay, we found 
that it was often very highly ranked. Deviation in the 
actual versus intended sequence specificity can only 
explain approximately 1/3 of all cases where we scored a 
success and Ramirez et al. did not. 

A second possible explanation is that the sensitivity of 
the PBM assay may be higher than that of other assays. 
B2H fold activation scales roughly with affinity of the 
ZFA, with a threshold of -100 nM (35). In the PBM 
assay, the protein concentration is typically ~100nM 
before washing, but the microarray probes have a very 
high local concentration at the surface of the array, 
which may facilitate re-binding. The PBM assay also 
does not require high specificity to a single 9-mer 
sequence; in previous analyses we and others have used 
PBMs to determine sequence preferences of proteins that 
bind well to many 8-mers [e.g. (36)]. Cornu et al. (34) 
found for several ZFAs that sequence specificity is import- 
ant for ZFN function. However, in our analysis, positive 
controls selected from Ramirez et al. appeared to possess 
at least some degeneracy in their binding specificity, 
indicating that the B2H assay is compatible with some 
degenerate binding. 

A third possibility is that multiple parameters determine 
success of ZFAs in the assays used by Ramirez et al. (and 
success as ZFNs), and that there is not a direct linear 
mapping between any single property of the protein 
(including its sequence specificity) and its performance in 
these assays. Properties of proteins that determine success 
in in vivo assays with heterologous fusion constructs could 
conceivably include expression level and solubility, as well 
as unanticipated protein-protein and protein-RNA inter- 
actions, both of which C2H2-ZFs can mediate (52). In 
addition, DNA sequence specificity itself can be defined 
and described in different ways, including relative prefer- 
ence for target versus random sequence, and tolerance to 
degeneracy in the target sequence. Consistent with a rela- 
tively poor relationship between sequence specificity 
in vitro and nuclease targeting capacity in vivo, Kim 
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et al. (51) recently reported that 44% of ZFN pairs dis- 
played restriction activity in vitro, but only 7% (23/315) 
yielded activity in a cell culture assay. 

An additional consideration underscored by our study 
is that the expectation that an artificial ZFA created by 
modular assembly will generally have exclusive specificity 
for a single 9-mer may be unrealistic. High specificity of 
ZFNs is believed to be desirable (34), but it is in fact 
typical for C2H2-ZFs found in nature to prefer a set of 
variants of a sequence motif [e.g. (36)]. This property (de- 
generacy) is apparently shared by artificial ZFAs created 
by modular assembly. To our knowledge, the individual 
C2H2-ZF modules used here have not been previously 
characterized for their relative preference to all possible 
3-mers in multiple contexts, and rules dictating the effects 
of interactions among adjacent C2H2-ZF modules are 
poorly understood at best. Therefore, it is difficult to say 
what should have been anticipated from our experiments. 
On the basis of our results, however, it appears that ex- 
tremely high specificity may not be a general property of 
the C2H2-ZF domain. Indeed, such strong sequence spe- 
cificity is not a feature of most eukaryotic TFs (36,48), and 
the regulatory and evolutionary strategies of metazoan 
genomes may even rely on flexible assemblies of relatively 
promiscuous binding factors (53,54). 

The fact that modular assembly of ZFAs is successful in 
the majority of cases in our analysis, and using our success 
criteria — notwithstanding CNN and TNN modules, 
which for reasons already outlined deserve further exam- 
ination — also supports the potential for C2H2-ZF 
modular assembly as an evolutionary mechanism (14). 
We further propose that the typically degenerate 
sequence specificity of individual C2H2-ZFs, and their 
frequent context dependency within ZFAs, may represent 
a beneficial evolutionary property. We note that this 
feature of ZFAs is not inconsistent with the general 
concept of modularity, as discussed in the Introduction. 
In any case, in 19 of the 20 successful ZFAs in our 
analysis, it is easy to manually align the high-scoring 
8-mers and 9-mers (and the resulting motifs) to the 
intended 9-mer target, and most of the modules do 
behave approximately as intended (i.e. most are colored 
green or yellow in Figure 3). 

Our findings also highlight the importance of 
characterizing or predicting the sequence preferences of 
individual C2H2-ZFs, and using them to infer the 
binding sites of artificial and natural ZFAs (15-17), 
which would be less relevant (or at least more 
complicated) if the assumption of modularity were gener- 
ally untrue. Ultimately, efforts to understand and predict 
the sequence specificities of ZFAs with high accuracy will 
require a more complete characterization of individual 
C2H2-ZFs, including their sequence preferences outside 
the canonical triplet, as well as a better grasp of the influ- 
ence of inter-finger interactions. Nonetheless, despite the 
degeneracy of most C2H2-ZF DNA-binding activities, 
and the influence of context, the intended 9-mer target 
typically ranks very highly in the PBM data, and other 
high-scoring sequences usually bear an obvious relation- 
ship to the intended 9-mer. A simple table of the most 
preferred triplet for all individual natural ZFs would 



thus be extremely useful even if degeneracy and context 
were ignored. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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